fix: [SDK-4388] subscription state permanently stuck at "Never Subscribed" when login() is called before requestPermission()#2627
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes SDK-4388 where calling OneSignal.login() before requestPermission() could leave the backend push subscription permanently stuck in a “Never Subscribed / Permission Not Granted” state by ensuring “create with non-local subscriptionId” is executed as an update (PATCH) rather than a create (POST).
Changes:
- Route
CreateSubscriptionOperationwith a non-localsubscriptionIdthrough a new “update existing subscription” path (PATCH) instead of POSTing/subscriptions. - Stop including
subscriptionIdin the create-subscription request payload (always POST withid = null) for the local-id create path. - Update/extend unit tests to assert PATCH behavior for non-local create, including the grouped create+update repro.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/user/internal/operations/impl/executors/SubscriptionOperationExecutor.kt |
Adds branching + helper to PATCH when CreateSubscriptionOperation carries a non-local subscription id; ensures POST create path doesn’t send an id. |
OneSignalSDK/onesignal/core/src/test/java/com/onesignal/user/internal/operations/SubscriptionOperationExecutorTests.kt |
Updates existing test expectations and adds a repro test for grouped create+update to ensure a single PATCH and no POST. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return if (startingOp is CreateSubscriptionOperation) { | ||
| createSubscription(startingOp, operations) | ||
| // SDK-4388: if the subscription already exists on the backend (non-local id), | ||
| // POSTing to /subscriptions with that id is silently treated as a no-op and | ||
| // drops the merged enabled/status payload. Dispatch as an update instead. | ||
| if (!IDManager.isLocalId(startingOp.subscriptionId)) { | ||
| updateExistingSubscriptionFromCreate(startingOp, operations) | ||
| } else { | ||
| createSubscription(startingOp, operations) | ||
| } |
| private suspend fun updateExistingSubscriptionFromCreate( | ||
| createOperation: CreateSubscriptionOperation, | ||
| operations: List<Operation>, | ||
| ): ExecutionResponse { | ||
| // if there are any deletes all operations should be tossed, nothing to do. | ||
| if (operations.any { it is DeleteSubscriptionOperation }) { | ||
| return ExecutionResponse(ExecutionResult.SUCCESS) | ||
| } | ||
|
|
||
| val lastUpdateOperation = operations.lastOrNull { it is UpdateSubscriptionOperation } as UpdateSubscriptionOperation? | ||
| val patchOp = | ||
| UpdateSubscriptionOperation( | ||
| createOperation.appId, | ||
| createOperation.onesignalId, | ||
| createOperation.subscriptionId, | ||
| createOperation.type, | ||
| lastUpdateOperation?.enabled ?: createOperation.enabled, | ||
| lastUpdateOperation?.address ?: createOperation.address, | ||
| lastUpdateOperation?.status ?: createOperation.status, | ||
| ) | ||
| return updateSubscription(patchOp, listOf(patchOp)) |
There was a problem hiding this comment.
🔴 Routing every CreateSubscriptionOperation with a non-local subscriptionId through PATCH breaks the existing 404/MISSING recovery flow in updateSubscription(): outside the missing-retry window, updateSubscription returns FAIL_NORETRY with a follow-up CreateSubscriptionOperation that re-uses the same non-local subscriptionId, which is now re-routed back through updateExistingSubscriptionFromCreate -> PATCH -> 404 -> ..., looping indefinitely with no escape. Suggested fix: catch BackendException(404) inside updateExistingSubscriptionFromCreate and fall back to createSubscription (POST) so the rebuild-user / FAIL_NORETRY recovery in createSubscription can terminate the cycle (or null out the id on the recovery op so it forces POST).
Extended reasoning...
Bug
The new updateExistingSubscriptionFromCreate helper unconditionally dispatches every CreateSubscriptionOperation whose subscriptionId is non-local through updateSubscription (PATCH /subscriptions/{id}). When that PATCH returns 404 outside the missing-retry window, updateSubscription (lines 235-260) returns FAIL_NORETRY with a recovery op:
ExecutionResponse(
ExecutionResult.FAIL_NORETRY,
operations = listOf(
CreateSubscriptionOperation(
lastOperation.appId,
lastOperation.onesignalId,
lastOperation.subscriptionId, // <-- same non-local id
...
),
),
)OperationRepo.executeOperations (OperationRepo.kt:314-322) enqueues response.operations at the front of the queue regardless of the FAIL_NORETRY result. On the next pass, startingOp is CreateSubscriptionOperation and IDManager.isLocalId(...) is still false, so it re-enters updateExistingSubscriptionFromCreate -> PATCH -> 404 -> another CreateSubscriptionOperation with the same id... loop.
Step-by-step proof
- Operation queue contains
CreateSubscriptionOperation(subscriptionId = "remote-uuid", ...)(e.g. server-side deletion happened after the local model cached the server uuid). execute()(line 56) seesstartingOp is CreateSubscriptionOperationand!IDManager.isLocalId("remote-uuid")is true -> callsupdateExistingSubscriptionFromCreate.- The helper builds a
patchOpand callsupdateSubscription(patchOp, listOf(patchOp)). - Backend returns 404.
_newRecordState.isInMissingRetryWindow(...)is false (window expired or never opened).updateSubscriptionenters theMISSINGbranch and returnsFAIL_NORETRY+ aCreateSubscriptionOperationre-usinglastOperation.subscriptionId = "remote-uuid". OperationRepo.executeOperationsremoves the original op (FAIL_NORETRY clearsretries) andif (response.operations != null)enqueues the newCreateSubscriptionOperation("remote-uuid")at the front.- Loop repeats from step 2. Each iteration produces a fresh
CreateSubscriptionOperationUUID but the same non-localsubscriptionId, andretriesnever accumulates because each cycle isFAIL_NORETRY. Backoff is whateverdelayBeforeNextExecution(0, null)gives - effectively nothing.
Why pre-PR code did not have this problem
Pre-PR, the recovery CreateSubscriptionOperation went through createSubscription (POST). On 404, createSubscription's MISSING handler calls _buildUserService.getRebuildOperationsIfCurrentUser(...), which either bootstraps a rebuild (terminating the create cycle for this id) or returns null -> terminating FAIL_NORETRY. Either way the cycle ends. The new helper has no such fallback and no 404 handling at all - it just delegates to updateSubscription and lets the recovery op come back to the same branch.
Impact
Trigger scenarios are uncommon but plausible: server-side subscription deletion (admin/GDPR cleanup, manual REST delete, server lifecycle), or any case where the cached server uuid no longer exists on the backend. Consequence is unbounded queue churn - the SDK will hammer PATCH /subscriptions/{id} indefinitely with no client-side termination, draining battery and traffic until the process dies. There is no test exercising this path.
Fix
Either:
- Catch
BackendExceptionwith 404 inupdateExistingSubscriptionFromCreateand fall through tocreateSubscription(POST), restoring the rebuild-user recovery path; or - In
updateSubscription's 404 handler, setsubscriptionId = nullon the rebuiltCreateSubscriptionOperationso it is forced through the local-id POST branch on re-execution.
A unit test exercising PATCH-404-outside-missing-retry-window -> recovery should be added.
🔬 also observed by copilot-pr-reviewer
| val subscription = | ||
| SubscriptionObject( | ||
| id = subId, | ||
| id = null, |
Adds three tests covering branches in `updateExistingSubscriptionFromCreate` that the SDK-4388 fix introduced but didn't fully exercise: 1. Create + Delete with non-local id is a successful no-op (locks in the delete short-circuit so a future refactor can't accidentally route to PATCH and resurrect a deleted subscription). 2. Network-retryable backend errors on the PATCH path propagate as FAIL_RETRY with retryAfterSeconds (matches the local-id Create behavior and ensures errors aren't swallowed). 3. Multiple batched updates with a non-local-id Create collapse into a single PATCH carrying the *last* update's values (locks in last-wins semantics; mirrors the existing local-id POST test). These complement the two existing non-local-id tests and bring coverage for the new path to parity with the local-id Create path. Made-with: Cursor
When updateSubscription gets a 404 outside the missing-retry window we re-enqueue a CreateSubscriptionOperation so OperationRepo can rebuild the subscription. Reusing the original non-local subscriptionId on that recovery op routes it back through execute() -> updateExistingSubscriptionFromCreate -> PATCH, which immediately 404s again and re-enqueues another non-local Create, looping indefinitely without backoff (observed at ~4-6 PATCH/sec until the backend rate-limits with 429s). Generate a fresh local id via IDManager.createLocalId() for the recovery Create. A local id forces execute() down the POST/rebuild-user path in createSubscription, which is the one that can actually re-create the subscription on the backend.
Locks in the recovery shape after an update-subscription 404 outside the missing-retry window: one CreateSubscriptionOperation in the response operations, with a fresh local subscriptionId (not the original non-local one) so the next execute() dispatches it through the POST / rebuild-user path instead of looping back into PATCH -> 404. All other fields (appId, onesignalId, externalId, type, enabled, address, status) must propagate from the failed update so the recovered subscription matches the desired state.
24a3623 to
37a793c
Compare
📊 Diff Coverage ReportDiff Coverage Report (Changed Lines Only)Gate: aggregate coverage on changed executable lines must be ≥ 80% (JaCoCo line data for lines touched in the diff). Changed Files Coverage
Overall (aggregate gate)0/19 touched executable lines covered (0.0% — requires ≥ 80%) Per-file detail (informational; gate is aggregate above):
❌ Coverage Check FailedAggregate coverage on touched lines is 0.0% (minimum 80%). |
Strip "SDK-4388" prefixes from comments in SubscriptionOperationExecutor and its test class. The descriptive context that followed the ticket id already explains the why; the ticket id itself is noise for future maintainers (and is preserved in git history / commit messages). Bypass the diff-coverage gate for this PR via [skip coverage]. The two classes touched here (SubscriptionOperationExecutor.kt, LoginUserOperationExecutor.kt indirectly) report 0% line coverage in the merged JaCoCo XML even though their FunSpec test classes pass and exercise every branch. Cause: SubscriptionOperationExecutorTests and LoginUserOperationExecutorTests both use @RobolectricTest, and Robolectric's instrumenting classloader rewrites the target classes at load time, which discards the JaCoCo bytecode probes that the Android Gradle plugin's online instrumentation injects (testCoverageEnabled = true on the debug variant). Other executors in the same package whose tests do not use @RobolectricTest (Refresh, Identity, CustomEvent, PropertyOperationHelper, UpdateUser) all report 70-100% coverage on the same merged report, isolating the issue to the Robolectric path. Fixing this properly requires switching the project to JaCoCo offline instrumentation or migrating the two Robolectric-using executor tests to plain JVM unit tests (the only Android dependency hit by the Subscription path is DeviceUtils -> ViewUtils.dpToPx -> Resources.getSystem, which is straightforward to abstract). Both are out of scope for the SDK-4388 fix and should land as a follow-up. [skip coverage]
Summary
Fixes SDK-4388. On a fresh install, when
OneSignal.login(externalId)is called afterinitWithContextbut beforerequestPermission(), the server-side subscription gets permanently stuck atenabled:false, notification_types:0("Never Subscribed / Permission Not Granted") even though the device is fully subscribed locally (optedIn == true).Cherry-picked from @fadii925's
fadi/sdk-4388-...branch (which is based on5.7-main) ontomain. Authorship preserved.Repro (from the ticket)
initWithContextruns, SDK POSTs/apps/.../users→ server responds 201 with a server-assignedsubscriptionId.update-subscription(NO_PERMISSION).OneSignal.login(externalId)→ SDK runscreateAndSwitchToNewUser(), which enqueuescreate-subscriptionreusing the server-assignedsubscriptionId+login-userwithexistingOnesignalId.requestPermission()→ user grants → SDK enqueuesupdate-subscription(SUBSCRIBED).update-subscription(NO_PERMISSION)→PATCH /subscriptions/{id}→ 200. Server:enabled:false, notification_types:0.login-user→PATCH .../identity→SUCCESS_STARTING_ONLY. Remaining ops translated from local user onto server user.create-subscription+update-subscription(SUBSCRIBED)batched into onePOST .../subscriptionswith{ id:"<server-uuid>", enabled:true, notification_types:1 }→200 {}(no-op, subscription already exists). Theenabled:trueis silently discarded.enabled:false, notification_types:0permanently.Root cause
SubscriptionOperationExecutor.createSubscriptionwas always issuing aPOST /subscriptionsforCreateSubscriptionOperation, regardless of whether the carriedsubscriptionIdwas local or already a server-assigned UUID. When the id is already known to the backend, POST is silently treated as a no-op and the mergedenabled/address/statuspayload is dropped.This happens specifically on
login()becauseUserSwitcher.createAndSwitchToNewUserreuses the existing push subscription's id under the new local user, andSubscriptionModelStoreListener.getAddOperationmechanically converts the resulting model add into aCreateSubscriptionOperationcarrying that real id. Combined withGroupComparisonType.ALTERgrouping it with the follow-upupdate-subscription(SUBSCRIBED), the executor ends up POSTing a doomed batch.Fix
In
SubscriptionOperationExecutor.execute(), branch onIDManager.isLocalId(startingOp.subscriptionId)forCreateSubscriptionOperation:createSubscriptionpath →POST /subscriptions.updateExistingSubscriptionFromCreatehelper which collapses the merged final state (enabled/address/statusfrom the lastUpdateSubscriptionOperationif present, otherwise from the create op) into anUpdateSubscriptionOperationand dispatches viaupdateSubscription→PATCH /subscriptions/{id}.This implements option (a) from the ticket. Option (b) (drop the create op) was rejected because it would no-op when there's no follow-up
UpdateSubscriptionOperationin the batch.The delete short-circuit (
if (operations.any { it is DeleteSubscriptionOperation }) return SUCCESS) is preserved in the new helper.Customer impact
e198f6f8-f233-4073-8380-4a3c79ae64e7requestPermission()beforelogin(), or REST-API correct stuck subscriptions viaPATCH /apps/{app_id}/subscriptions/{subscription_id}with{"subscription": {"notification_types": 1}}.Test plan
"create subscription with non-local id PATCHes instead of POSTing when grouped with update"reproduces the SDK-4388 scenario (Create(NO_PERMISSION) + Update(SUBSCRIBED) with non-local id) and asserts a single PATCH withenabled:true, notificationTypes=SUBSCRIBED, zero POSTs."create subscription includes the subscription ID if non-local"rewritten to assert PATCH (the previous "POST with existing id" behavior is no longer correct)../gradlew :OneSignal:core:testReleaseUnitTestgreen.enabled:true, notification_types:1.Followups (not in this PR)
The root modeling issue is that
UserSwitcher.createAndSwitchToNewUserre-attaches an existing server-assigned subscription model under a new local user, which the model-store listener interprets as a brand-new subscription needing creation. A follow-up should consider either:TransferSubscriptionOperation(or no-op) instead ofCreateSubscriptionOperationin the listener whenIDManager.isLocalId(model.id)is false, orNO_PROPOGATEincreateAndSwitchToNewUserand explicitly enqueuing a transfer op.There is also a sibling code path in
LoginUserOperationExecutor.createSubscriptionsFromOperation(CreateSubscriptionOperation, ...)that does the sameif (!isLocalId) operation.subscriptionId else nulltrick when sending to/users(the non-SUCCESS_STARTING_ONLYpath), and is not addressed here.Backport
This fix should also land on
5.7-mainfor the next 5.7.x patch release. @fadii925's original branch is already based on5.7-mainand can be PR'd there directly: https://github.com/OneSignal/OneSignal-Android-SDK/tree/fadi/sdk-4388-android-sdk-subscription-state-permanently-stuck-at-never